White Wine quality by Pauline Vercruysse

1. Citation

This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: http://dx.doi.org/10.1016/j.dss.2009.05.016

2. About the dataset

This report explores a dataset containing quality rates and attributes of about 5,000 white wines.The inputs are physicochemical test results (e.g. pH or citric acid) and the output is a sensory data of the wine quality (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The dataset is related to a white variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

3. Attribute information:

The dataset consists of 12 variables about attributes, with almost 4,900 observations.

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
   1 - fixed acidity (tartaric acid - g / dm^3)
   2 - volatile acidity (acetic acid - g / dm^3)
   3 - citric acid (g / dm^3)
   4 - residual sugar (g / dm^3)
   5 - chlorides (sodium chloride - g / dm^3
   6 - free sulfur dioxide (mg / dm^3)
   7 - total sulfur dioxide (mg / dm^3)
   8 - density (g / cm^3)
   9 - pH
   10 - sulphates (potassium sulphate - g / dm3)
   11 - alcohol (% by volume)
Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

4. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavour to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of the wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Analysis

Summary of the dataset

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Observations from the Summary I am not focusing on all the variables. I will exclude the density, which depends on the combination of alcohol and sugar, the free sulfur dioxide, which is a part of the total sulfur dioxide (total sulfur dioxide(SO2) = free SO2 + bound SO2, reference), and the fixed acidity, which is similar measure of the acidity than pH. According to this site “Fixed acidity is measured as total acidity minus volatile acidity. Generally, pH is a quantitative assessment of fixed acidity.”

In the remaining parameters, we can already observe some interesting facts. Details for each parameter will be given in the single variable study. - The mean residual sugar is 6.391 g/L, but the maximum is 65.8 g/L and this wine seems to be too sweet (over 45g/L) and is an outlier. - The mean level of chlorides is 0.045 with a maximum of 0.345. This maximum point might be an outlier. - The mean total sulfur dioxide is 138 ppm, with more than 75% of the wine over 108ppm (first quartile). Over 50ppm, the sulfur dioxide will have an impact on the taste of the wine. This parameter might have an impact on the quality.

Univariate Plots Section

Quality of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The quality of most of the wines is 6, on a scale from 3 to 9. The wine quality has a normal distribution.

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The mean volatile acidity is 0.2782 g/L (median 0.26), with a maximum of 1.1 g/L, which corresponds to the US legal limits for white wine. Most wines have a volatile acidity between 0.15 g/L and 0.35 g/L with a high peak at 0.25 g/L (first graphic) and a normal distribution with a small right-skewed. By reducing the binwidth (second graphic), we observe that there is not a single peak but a high count of wine between 0.24 g/L and 0.28 g/L. We know that too high quantity of volatile acidity is bad for the wine and it will be interesting to study the link of this parameter with the quality.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The mean level of citric acid is 0.3342 g/L (median 0.32). Some wine has no citric acid, however, it represents only a few with a first quartile at 0.27 g/L. Most of the wines have between 0.2 g/L and 0.55 g/L, with normal distribution and few wines with a concentration higher than 1 g/L. The presence of citric acid is a benefit, bringing ‘freshness’ and flavour to wines and it will be interesting to see the impact on the quality.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Looking at ‘table’ summary we see that there was one outlier: 65.8, so I limited the data to all wines with residual sugar less or equal to 45 g/L . The distribution is skewed so I used log10 on the x-axis for a second graph.

We can see that the residual sugar concentration is a bimodal distribution, meaning that there are two different groups: dry (not sweet) white wine (1 to 4 g/L) and slightly sweet white wines (4 to 19 g/L).

For this reason, a new variable sugar_category is created with IFELSE function with the limit of 4 g/L (exclude from the dry group) between the dry and slight_sweet .

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality sugar_category
## 1       6   slight_sweet
## 2       6            dry
## 3       6   slight_sweet
## 4       6   slight_sweet
## 5       6   slight_sweet
## 6       6   slight_sweet

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The maximal value (0.346) might not be an outlier, however very few wines have more than 0.10. The next graphic is to focus on the wines with a level lower or equal to 0.10.

The mean level of chloride is 0.046 (median 0.043), with most of the wines having a level between 0.025 and 0.06 with a normal distribution.

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

As said before, the mean total sulfur dioxide is 138 ppm (median: 134 ppm), with most of the wines between 60 and 220 ppm and a normal distribution. Over 50ppm, the sulfur dioxide will have an impact in the taste of the wine. This parameter might have an impact in the quality.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

All wine have a pH range between 2.7 to 3.8, with a mean at 3.2. None of the wine are basic (no pH higher than 7 ).

Sulfates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The mean level of sulphates is 0.49 g/L (median 0.47) with a normal distribution.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol level distribution is right skewed, with most of the wines at 9.5% and a mean at 10.5%.

Univariate Analysis

What is the structure of your dataset?

There are 4898 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol).

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is the quality. I would like to determine which features are influencing the quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

After researching information about wine, I think that residual sugar, alcohol, volatile acidity and citric acid contribute most to the quality.

Did you create any new variables from existing variables in the dataset?

I have observed that the distribution of residual sugar is bimodal, meaning that there are two different groups. I have created a new variable sugar_category, with 2 classes dry (1 to 4 g/L excluded) and slight_sweet ( equal or more than 4 g/L).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution is skewed so I used log10 and the distribution was bimodal.

Bivariate Plots Section

Selection of variables

I subselect the variables decided at the end of the Summary part: exclude the density (depends on the combinaison of alcohol and sugar), the free sulfur dioxide (part of the total sulfur dioxide) and the fixed acidity (similar than pH). We can measure the correlation coefficients to be sure.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376
## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665
## 
##  Pearson's product-moment correlation
## 
## data:  wine$free.sulfur.dioxide and wine$total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501
## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$pH
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4485154 -0.4026542
## sample estimates:
##        cor 
## -0.4258583
##   volatile.acidity citric.acid residual.sugar chlorides
## 1             0.27        0.36           20.7     0.045
## 2             0.30        0.34            1.6     0.049
## 3             0.28        0.40            6.9     0.050
## 4             0.23        0.32            8.5     0.058
## 5             0.23        0.32            8.5     0.058
## 6             0.28        0.40            6.9     0.050
##   total.sulfur.dioxide   pH sulphates alcohol quality sugar_category
## 1                  170 3.00      0.45     8.8       6   slight_sweet
## 2                  132 3.30      0.49     9.5       6            dry
## 3                   97 3.26      0.44    10.1       6   slight_sweet
## 4                  186 3.19      0.40     9.9       6   slight_sweet
## 5                  186 3.19      0.40     9.9       6   slight_sweet
## 6                   97 3.26      0.44    10.1       6   slight_sweet

Scatterplot matrice with subset of samples (1000)

## Warning in ggcorr(sample, hjust = 0.75, size = 3, label = TRUE, label_size
## = 3, : data in column(s) 'sugar_category' are not numeric and were ignored

We can see some correlations like: * total sulfur dioxide vs residual sugar (moderate positive correlation) * alcohol vs residual sugar (moderate negative correlation) * alcohol vs chlorides (small negative correlation) * alcohol vs total sulfur dioxide (moderate negative correlation) * quality vs alcohol (moderate positive correlation)

First, I will have a look at the correlations observed in the scatterplot matrice between the objective parameters (exclude quality). Second, I want to look closer at plots involving quality and some other variables like alcohol, volatile acidity, residual sugar and citric acid. Indeed, in the [original paper] (http://dx.doi.org/10.1016/j.dss.2009.05.016) these factors were considered to take part in the model of quality.

Compare objective parameters of wines

Residual sugar vs Total sulfur dioxide & Residual sugar vs Alcohol

Comparing residual sugar vs total sulfur dioxide or alcohol, the first plots suffers from overplotting, not ideal x scale, one outlier. Adding jitter, transparency, changing the x-scale to log10, changing the y limits, and excluding the sugar outlier (with subset) let us see the moderate correlations calculated before. I add linear regression line to best visual it. Moreover, I have created two groups based on the sugar level, with a limit of 4. With the vertical line we can observe that for total sulfur dioxide vs residual sugar, both groups have the same tendency, while for alcohol vs residual sugar, the two groups seem to have different patterns.

Alcohol vs Chlorides & Alcohol vs Total sulfur dioxide

Comparing alcohol vs total sulfur dioxide or chlorides, the first plots suffer from overplotting and large spreading of points. Adding jitter, transparency, smaller points, and changing the y limits let us see the moderate correlations calculated before. I add linear regression lines to best visual it. For chlorides, I did not consider in the graphic the top 5% of values and for total sulfur dioxide I did not consider the top 1% of the values.

Quality and other variables

Quality vs alcohol

## subtable$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## subtable$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## subtable$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## subtable$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## subtable$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## subtable$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## subtable$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## 
##  Pearson's product-moment correlation
## 
## data:  subtable$alcohol and subtable$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

There is a correlation between the Alcohol and Quality. It seems like a threshold at an alcohol level of 11 to separate the lower quality wines (3 to 6) and the upper quality wines (7 to 9)

Quality vs volatile acidity

## subtable$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## subtable$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## subtable$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## subtable$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## subtable$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## subtable$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## subtable$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

The volatile acidity correspond to the amount of acetic acid in wine and at high level it can lead to an unpleasant taste. I was expecting that higher quality wines would have a lower level of volatile acidity. Surprisely, all wines have same range of volatile acidity level. In this dataset the volatile acidity seems to not influence the quality rating.

Quality vs residual sugar

The outlier of residual sugar is directly excluded from the graphic.

## subtable$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## subtable$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## subtable$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## subtable$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## subtable$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## subtable$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## subtable$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60
## 
##  Pearson's product-moment correlation
## 
## data:  subtable$residual.sugar and subtable$quality
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683

It doesn’t look like the higher quality have a certain level of residual sugar. We can say that the residual sugar level is not a major component influencing the wine quality (see correlation coefficient). However, in the dataset, I split wines in two categories depending of their residual sugar level. We observe that both groups have the distribution of quality (graphic below) and means are really close.

## subtable$sugar_category: dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.95    7.00    9.00 
## -------------------------------------------------------- 
## subtable$sugar_category: slight_sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.824   6.000   9.000

Quality vs citric acid

## subtable$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2100  0.2575  0.3450  0.3360  0.3850  0.4700 
## -------------------------------------------------------- 
## subtable$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1900  0.2900  0.3042  0.4000  0.8800 
## -------------------------------------------------------- 
## subtable$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2400  0.3200  0.3377  0.4100  1.0000 
## -------------------------------------------------------- 
## subtable$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.338   0.380   1.660 
## -------------------------------------------------------- 
## subtable$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2800  0.3100  0.3256  0.3600  0.7400 
## -------------------------------------------------------- 
## subtable$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.2800  0.3200  0.3265  0.3600  0.7400 
## -------------------------------------------------------- 
## subtable$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   0.340   0.360   0.386   0.450   0.490

As described before, citric acid can add ‘freshness’ and flavor to wines. I was expecting that higher quality wines have a higher level of citric acid. However, the average of citric acid is quite constant over the quality. We can also observe that among all qualities, the mean is quite constant (around a citric acid level of 0.33) and with the increase of quality, the variance seems to reduce.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I focused on the correlations between quality and * residual sugar: the level of residual sugar is constant among the quality groups. The residual sugar level seems to not be a major component influencing the wine quality. * volatile acidity: there is no variation of volatile acidity between the different quality groups. * citric acid: all the different quality groups have quite the same average citric acid level. With higher quality, the variance is reduced. * alcohol: There is a correlation between quality and alcohol. The higher the level of alcohol is, the higher the quality is.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a correlation between: - the alcohol and the total sulfur dioxide - the alcohol and chloride - the alcohol and residual sugar Hower we observed two groups in the correlation alcohol vs residual sugar . These two groups also correspond to the two groups observed in the bimodal distribution of residual sugar, with a limit of 4 g/L.

What was the strongest relationship you found?

The strongest relationship observed was between alcohol and quality.

Multivariate Plots Section

Alcohol and residual sugar tendency depending on the sugar category

In the previous part, we observed that alcohol vs residual sugar seems to have 2 behaviour with the limit of 4 g/L of residual sugar. In the first part I have created the variable sugar_category, separating the wines into two categories. The following graphic explores if there is a different behaviour depending on the sugar_category.

## # A tibble: 2 x 3
##   sugar_category alcohol.median alcohol.mean
##   <chr>                   <dbl>        <dbl>
## 1 dry                     11.0          11.0
## 2 slight_sweet             9.80         10.2

The ‘dry’ wines have an increase of alcohol with the increase of residual sugar, while the ‘slight sweet’ sugar have an opposite correlation. In fact it means that by increasing the level of sugar, it increases the level of alcohol until a “breaking point” of 4 g/L. After this limit, the sugar acts as an inhibitor of the alcohol. The idea is thus to analyse the variables in the rest of the study by separating these two groups. We observe that the dry wines contain in average more alcohol than the slight sweet wine.

Impact of sugar category on the alcohol vs chlorides

## # A tibble: 2 x 3
##   sugar_category chloride.median chlorides.mean
##   <chr>                    <dbl>          <dbl>
## 1 dry                     0.0400         0.0441
## 2 slight_sweet            0.0450         0.0470

Pearson correlation coefficient for each sugar category

## subtable$sugar_category: dry
## [1] -0.3921068
## -------------------------------------------------------- 
## subtable$sugar_category: slight_sweet
## [1] -0.3367922

Both sugar categories have a negative correlation between alcohol and chlorides. Both categories have the same average chloride level. But the Pearson correlation coefficient is higher in the dry wines than the sweet wine. This means that the alcohol level of the dry wines might be more influenced by the chlorides than the sweet wines.

Impact of sugar category on the alcohol vs total sulfur dioxide

## # A tibble: 2 x 3
##   sugar_category total.sulfur.median total.sulfur.mean
##   <chr>                        <dbl>             <dbl>
## 1 dry                            117               120
## 2 slight_sweet                   151               152

Pearson correlation coefficient for each sugar category

## subtable$sugar_category: dry
## [1] -0.2257001
## -------------------------------------------------------- 
## subtable$sugar_category: slight_sweet
## [1] -0.4608308

The sweet wines have a higher level of total sulfur dioxide than dry wines. By separating the sugar categories, we can say that for the dry wines there is no correlation between alcohol and total sulfur dioxide (|coeffiencient| < 0.3), while for the sweet wines there is a negative correlation between alcohol and the total sulfur dioxide level.

Alcohol vs quality depending on the sugar category

Pearson correlation coefficient for each sugar category

## subtable$sugar_category: dry
## [1] 0.453321
## -------------------------------------------------------- 
## subtable$sugar_category: slight_sweet
## [1] 0.4292139

Both sugar categories have a positive correlation between alcohol and quality. But the Pearson correlation coefficient is higher in the dry wines than the sweet wine. This means that the quality of the dry wines might be more influenced by the alcohol than the sweet wine quality. The presence of sugar might “hide” the difference of quality between wines in the sweet category. Or the decrease of alcohol observed in the sweet wines makes these wines of a lower quality.

Impact of chlorides on the alcohol vs quality in each sugar category

Previously, we saw that the influence of the chlorides on the alcohol in the dry wines is more pronounced than in the sweet wines. In this graphic, we see that there is more a separation of the chloride levels in dry wines than in sweet wines. Because alcohol and chlorides are in a negative correlation, we see that for the same quality in dry wines, the wines with higher chlorides have a lower alcohol level.

Impact of total sulfur dioxide on the alcohol vs quality in each sugar category

Previously, we saw that the influence of the total sulfur dioxide on the alcohol in the sweet wines is more pronounced than in the dry wines. In this graphic, we see that there is more a separation of the total sulfur dioxide levels in the sweet wines than in the dry wines. Because alcohol and total sulfur dioxide are in a negative correlation, we see that for the same quality in sweet wines, the wines with higher total sulfur dioxide have a lower alcohol level.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We observe that in this dataset there is two type of behaviour depending on the sugar level. As previously observed, the alcohol level is correlated with the quality, which is the strongest correlation found. The correlation is more important in dry wines. Dry wines have more alcohol and in this category, the chloride level has an important negative influence on the alcohol. Sweet wines have less alcohol and in this category, the total sulfur dioxide level has an important negative influence on the alcohol.

The higher correlation value is 0.45 with alcohol vs quality in dry wines. This is not a high correlation level so we cannot use alcohol as a parameter for quality prediction.


Final Plots and Summary

Plot One: 2 types of wines depending on the residual sugar

Description One

Plus plotting the distribution of residual sugar, we have observed that the wines can be split into two groups: “dry” and “slight sweet” wines, with a limit of 4 g/L. We have found that aech group have a different behaviour.

Plot Two: Alcohol vs quality depending on the sugar category

Description Two

Alcohol level and Quality have correlation in the dry wine group and slight sweet group, with respectively a coefficient of 0.45 and 0.43. It means that, the more alcoholic the wine is, the better the rater will find it. The influence of alcohol is more pronounced in the dry wines. However, 0.4 is not a high correlation level so we cannot use alcohol as a parameter for quality prediction.

Plot Three: Influence of variables on Quality vs Alcohol

Description Three

For each type of sugar category, we have observed other variables that might influence the level of alcohol for a given quality. For the dry wines, high level of chlorides seems to reduce the level of alcohol. For the slight sweet wines, high level of total sulfur dioxide seems to reduce the level of alcohol.


Reflection

The analysis of this dataset of white wines lead us to this conclusions: * There are 2 groups of wines based on their residual sugar. * By increasing the amount of residual sugar, it increases the level of alcohol until a breaking point. After this breaking point, the addition of sugar is asking as an inhibitor of the alcohol. * The dry wine whites contain more alcohol than the slightly sweet white wines. * The chlorides decrease the alcohol level, with a more pronounced effect in the dry wines. * The total sulfur dioxide decrease the alcohol level, with a more pronounced effect in the slightly sweet white. * Alcohol level and quality are positively correlated, with a stronger effect in the dry wines. * Surprisingly the volatile acidity level, the residual sugar level and the citric acid level do not have an influence on the quality.

The level of alcohol and residual sugar can be controlled during the production process and the sulfur dioxide is added. However the chloride concentration in the wine is influenced by terroir [ref]. The idea is to add step by step adding sugar to be before the breaking point and producing during these steps the higher level of alcohol. At the same time, reducing the amount of sulfur dioxide could improve the quality of the wine. However, we can conclude that the experts’ quality rating is mostly based on their personal taste or could depend on other variables like the year of production, the grate types or the terroir.

For further exploration, the same analysis could be done on red wines and compare the results with this white wine dataset.